Making sense of Medstat.dk data

code
analysis
exploration
cost
drugs
Author

Henrik Vitus Bering Laursen

Published

September 19, 2024

Exploring Medstat data: Drug prescriptions in Denmark

In this post, I’ll showcase how to work with publicly available drug prescription data from medstat.dk. This is a great way to practice data manipulation and analysis, using real-world data.

Downloading the data

You can download all the datasets from medstat.dk’s download section. If you want to download multiple files and not have to work with the URLs, I recommend using the downloadthemall extension to batch download the data for easier local manipulation.

Since I’m using GitHub to store my work, I’ll download the data directly from Medstat. GitHub has a 2GB storage limit per repository, which means large public datasets can quickly exceed this. Therefore, it’s important to manage the size of the files efficiently.

Loading and processing the data

Good old trusty tidyverse has a lot of what I need, specifically the readr package, “secretly” loaded the library in an invisible chunk below, with #| include: FALSE, but also #| warning: FALSE because it wants to warn me about conflicts with the tidyverse package.

#| include: FALSE
#| warning: FALSE
library(tidyverse)

Fetching the data from medstat

You can use tools like the MarkDownload extension to find the download links from the resulting markdown file it hands you.

Now lets load an example dataset.

# Define the URL for the dataset ()
url_atc_2023 <- "https://medstat.dk/da/download/file/MjAyM19hdGNfY29kZV9kYXRhLnR4dA=="

# Read the dataset directly from the URL
df_atc_2023 <- read_delim(url_atc_2023, delim = ";", show_col_types = FALSE)  # Assuming the file is tab-delimited

# Print the first few rows of the dataset
head(df_atc_2023)
# A tibble: 6 × 14
  A...1 `2023` `0...3` `0...4` A...5 A...6  ...7  ...8 `5009988` `2661589` ...11
  <chr>  <dbl>   <dbl>   <dbl> <chr> <chr> <dbl> <dbl>     <dbl>     <dbl> <dbl>
1 A       2023       0       0 2     00-17 25116  44.6     32451     21308    NA
2 A       2023       0       0 1     00-17 21962  37.0     40295     29871    NA
3 A       2023       0       0 0     00-17 47080  40.7     72746     51179    NA
4 A       2023       0       0 2     18-24 23313  92.3     51797     21707    NA
5 A       2023       0       0 1     18-24 13137  50.1     29808     15882    NA
6 A       2023       0       0 0     18-24 36450  70.8     81606     37590    NA
# ℹ 3 more variables: ...12 <dbl>, `94` <chr>, ...14 <lgl>

Ok, this data is without headers. A way to save space probably.

Understanding the data structure

Thankfully there is a documentation file the website: Downloadbeskrivelse medstat. This file explains the strucutre of the datasets and for the data that we look at, YYYY_atc_code_data.txt, the following variables apply (translated to English from the original Danish):

  • atc: Anatomical Therapeutic Chemical code
  • year: Year of data
  • sector: Healthcare sector
  • region: Geographic region
  • sex: Gender
  • agegroup: Age category
  • count_persons: Number of individuals
  • count_persons_per1kpop: Number of individuals per 1,000 population
  • turnover: Sales turnover
  • reimbursement: Reimbursements provided
  • sold_amount: Amount of drugs sold
  • sold_amount_1kpop_day: Amount sold per 1,000 population per day
  • personreferabledata_perc: Percentage of data referable to individuals

So we can use colnames() to transfer these column names to our dataset.

# attach colnames
colnames(df_atc_2023) <- c("atc","year","sector","region","sex","agegroup","count_persons","count_persons_per1kpop","turnover","reimbursement", "sold_amount", "sold_amount_1kpop_day", "personreferabledata_perc")

# View data
head(df_atc_2023)
# A tibble: 6 × 14
  atc    year sector region sex   agegroup count_persons count_persons_per1kpop
  <chr> <dbl>  <dbl>  <dbl> <chr> <chr>            <dbl>                  <dbl>
1 A      2023      0      0 2     00-17            25116                   44.6
2 A      2023      0      0 1     00-17            21962                   37.0
3 A      2023      0      0 0     00-17            47080                   40.7
4 A      2023      0      0 2     18-24            23313                   92.3
5 A      2023      0      0 1     18-24            13137                   50.1
6 A      2023      0      0 0     18-24            36450                   70.8
# ℹ 6 more variables: turnover <dbl>, reimbursement <dbl>, sold_amount <dbl>,
#   sold_amount_1kpop_day <dbl>, personreferabledata_perc <chr>, `` <lgl>
glimpse(df_atc_2023)
Rows: 1,806,544
Columns: 14
$ atc                      <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", …
$ year                     <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
$ sector                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ region                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ sex                      <chr> "2", "1", "0", "2", "1", "0", "2", "1", "0", …
$ agegroup                 <chr> "00-17", "00-17", "00-17", "18-24", "18-24", …
$ count_persons            <dbl> 25116, 21962, 47080, 23313, 13137, 36450, 124…
$ count_persons_per1kpop   <dbl> 44.61, 37.04, 40.73, 92.26, 50.10, 70.79, 169…
$ turnover                 <dbl> 32451, 40295, 72746, 51797, 29808, 81606, 419…
$ reimbursement            <dbl> 21308, 29871, 51179, 21707, 15882, 37590, 104…
$ sold_amount              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ sold_amount_1kpop_day    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ personreferabledata_perc <chr> "94", "94", "94", "94", "94", "94", "94", "94…
$ NA                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Missing name for one column that doesnt fit with the “Downloadbeskrivelse medstat”. What is it?

# call it something
colnames(df_atc_2023)[is.na(colnames(df_atc_2023))] <- "missing_name"

# table of content
table(df_atc_2023$missing_name)
< table of extent 0 >

This column is empty. Maybe just an artifact from the delim file. Lets assume we can safely remove it.

df_atc_2023 <- df_atc_2023 |>  select(-missing_name)

Organising the data

Time to figure out how this data is structured, and organise the character variables into factors. Factors, as you may know, are a data structure in R, which is akin to a categorisation, ordered or unordered. Since we have age, we probably have at least some age categories that are ordered in intervals. Lets find out what the different variables likely to be factors are.

# Finding all unique values even though i could just consult "Downloadbeskrivelse medstat"
unique_values_all_columns <- unique(unlist(sapply(df_atc_2023, function(x) if(is.character(x)) unique(x))))
print(unique_values_all_columns)

Woops. I have hidden the massive output of this via #| results: 'hide'. I forgot that ATC has QUITE a lot of unique values. Lets make a data frame without the atc value.

# Remove atc
df_atc_2023_mini <- df_atc_2023 |>  select(-atc)

# Go through all variables that could be categories (often stored as characters by default) and display unique values
unique(unlist(sapply(df_atc_2023_mini, function(x) if(is.character(x)) unique(x))))
  [1] "2"     "1"     "0"     "A"     "00-17" "18-24" "25-44" "45-64" "65-79"
 [10] "80+"   "000"   "001"   "002"   "003"   "004"   "005"   "006"   "007"  
 [19] "008"   "009"   "010"   "011"   "012"   "013"   "014"   "015"   "016"  
 [28] "017"   "018"   "019"   "020"   "021"   "022"   "023"   "024"   "025"  
 [37] "026"   "027"   "028"   "029"   "030"   "031"   "032"   "033"   "034"  
 [46] "035"   "036"   "037"   "038"   "039"   "040"   "041"   "042"   "043"  
 [55] "044"   "045"   "046"   "047"   "048"   "049"   "050"   "051"   "052"  
 [64] "053"   "054"   "055"   "056"   "057"   "058"   "059"   "060"   "061"  
 [73] "062"   "063"   "064"   "065"   "066"   "067"   "068"   "069"   "070"  
 [82] "071"   "072"   "073"   "074"   "075"   "076"   "077"   "078"   "079"  
 [91] "080"   "081"   "082"   "083"   "084"   "085"   "086"   "087"   "088"  
[100] "089"   "090"   "091"   "092"   "093"   "094"   "95+"   "T"     "94"   
[109] NA      "92"    "95"    ">99"   "86"    "15"    "18"    "60"    "96"   
[118] "19"    "98"    "97"    "91"    "11"    "20"    "10"    "83"    "65"   
[127] "99"    "8"     "55"    "46"    "40"    "36"    "44"    "51"    "45"   
[136] "54"    "50"    "48"    "89"    "81"    "34"    "57"    "53"    "87"   
[145] "39"    "41"    "9"     "64"    "56"    "93"    "42"    "70"    "75"   
[154] "61"    "84"    "26"    "43"    "59"    "90"    "14"    "28"    "30"   
[163] "52"    "32"    "3"     "27"    "69"    "16"    "71"    "6"     "25"   
[172] "67"    "78"    "77"    "23"    "5"     "22"    "13"    "72"    "17"   
[181] "21"    "<1"    "66"    "38"    "62"    "73"    "88"    "76"    "12"   
[190] "4"     "58"    "33"    "82"    "29"    "63"    "31"    "80"    "79"   
[199] "74"    "85"    "49"    "68"    "35"   

Or, in a little more legible fashion, use some apply functions to first find the ones that are characters with sapply(), and then lapply() to create a list of all unique values of the character vars.

# find character variables
char_columns <- df_atc_2023_mini[sapply(df_atc_2023_mini, is.character)]

# apply the unique() function to each character variable
unique_char_values <- lapply(char_columns, unique)
unique_char_values
$sex
[1] "2" "1" "0" "A"

$agegroup
  [1] "00-17" "18-24" "25-44" "45-64" "65-79" "80+"   "000"   "001"   "002"  
 [10] "003"   "004"   "005"   "006"   "007"   "008"   "009"   "010"   "011"  
 [19] "012"   "013"   "014"   "015"   "016"   "017"   "018"   "019"   "020"  
 [28] "021"   "022"   "023"   "024"   "025"   "026"   "027"   "028"   "029"  
 [37] "030"   "031"   "032"   "033"   "034"   "035"   "036"   "037"   "038"  
 [46] "039"   "040"   "041"   "042"   "043"   "044"   "045"   "046"   "047"  
 [55] "048"   "049"   "050"   "051"   "052"   "053"   "054"   "055"   "056"  
 [64] "057"   "058"   "059"   "060"   "061"   "062"   "063"   "064"   "065"  
 [73] "066"   "067"   "068"   "069"   "070"   "071"   "072"   "073"   "074"  
 [82] "075"   "076"   "077"   "078"   "079"   "080"   "081"   "082"   "083"  
 [91] "084"   "085"   "086"   "087"   "088"   "089"   "090"   "091"   "092"  
[100] "093"   "094"   "95+"   "T"     "A"    

$personreferabledata_perc
 [1] "94"  NA    "92"  "95"  ">99" "86"  "15"  "18"  "2"   "60"  "96"  "19" 
[13] "98"  "97"  "91"  "11"  "20"  "10"  "83"  "65"  "99"  "8"   "55"  "46" 
[25] "40"  "36"  "44"  "51"  "45"  "54"  "50"  "48"  "89"  "81"  "34"  "57" 
[37] "53"  "87"  "39"  "41"  "9"   "64"  "56"  "93"  "42"  "70"  "75"  "61" 
[49] "84"  "26"  "43"  "59"  "90"  "14"  "1"   "28"  "30"  "52"  "32"  "3"  
[61] "27"  "69"  "16"  "71"  "6"   "25"  "67"  "78"  "77"  "23"  "5"   "22" 
[73] "13"  "72"  "17"  "21"  "<1"  "66"  "38"  "62"  "73"  "88"  "76"  "12" 
[85] "4"   "58"  "33"  "82"  "29"  "63"  "31"  "80"  "79"  "74"  "85"  "49" 
[97] "68"  "35" 

More manageable amount of unique values. Although it seems there are multiple age categories within the agegroup variable. I will remove df_atc_2023_mini and unique_char_values as they were created for the purpose of creating an overview of the unique values.

rm(df_atc_2023_mini, unique_char_values)

Lets focus on the unique age groups. The data for age groups must be duplicated for the ranges 000 to 017, and the age category “00-17”. This would not be included if the goal was to save space, as a single variable containing the age from 000 to 95+ could be grouped into intervals with a mutate() or cut() function.

From here, I will use dplyr, as it is easier to read the operations performed on the data.

I will explore the following questions:

  1. Are the age range equal to the age intervals?
  2. What do the sex categories represent?
  3. Exactly to what degree is the dataset full of duplicated data?

Age range versus intervals

To figure out whether the age range versus the prespecified intervals have the same data, we can summarise a variable, here turnover, for the two groups, and see if they are the same.

First, lets summarize turnover for the age interval “00-17”.

# Make a datset by filtering on 00-17
df_atc_2023_interval <- df_atc_2023 |>
  filter(agegroup == "00-17")

# Summarise it
turnover_sum_interval <- df_atc_2023_interval |>
  summarize(turnover = sum(turnover, na.rm = TRUE))

15.373.244 in 1000 DKK, so 15.373.244.000, or more than 15 billion DKK in 2023. For one age interval. Woav.

Now, lets try to do it for the age range.

# make vector to include in the filter
numbers <- sprintf("%03d", 0:17)

# Now filter in the same way as above, but for 000 to 017
df_atc_2023_range <- df_atc_2023 |>
  filter(agegroup %in% numbers)

# Summarise
turnover_sum_range  <- df_atc_2023_range |> mutate(
    agegroup_combined =
      ifelse(
        agegroup %in% numbers,
        "000-017",
        agegroup)) |>
  ungroup() |>
  summarize(turnover = sum(turnover, na.rm = TRUE)) #, .groups = "drop")
turnover_sum_range
# A tibble: 1 × 1
  turnover
     <dbl>
1 14857936

Now lets compare.

# Summary
print(turnover_sum_interval)
# A tibble: 1 × 1
  turnover
     <dbl>
1 15373247
print(turnover_sum_range)
# A tibble: 1 × 1
  turnover
     <dbl>
1 14857936
# There is a difference.. How much?
(print(turnover_sum_interval)-print(turnover_sum_range))
# A tibble: 1 × 1
  turnover
     <dbl>
1 15373247
# A tibble: 1 × 1
  turnover
     <dbl>
1 14857936
  turnover
1   515311

Okay, these two should be equal. Where did those 515.309.000 go? Thats a difference of 3.47%. The number calculated with inline R code - with “{r} round(((turnover_sum_interval-turnover_sum_range)/turnover_sum_range)*100,2)” surrounded by backticks - because that is nice to have in case you want to have your numbers in the text change accordingly with any change in the data you have done. Anyway.. what went wrong?

How very strange. How do I figure out why there is a difference?

Are there the same amount of people in each category?

# Count amount of people in each category
count_persons_range <- df_atc_2023 |>
  filter(agegroup %in% numbers) |>
  summarise(total = sum(count_persons))
count_persons_int <- df_atc_2023 |>
  filter(agegroup == "00-17") |>
  summarise(total = sum(count_persons))

# Difference
count_persons_int-count_persons_range
   total
1 131104

Ok. 131.087 people are missing from the “range” dataset. Thats a difference of 0.47% I might have done something wrong. Or there is a mistake in the data.

Moving on

Let’s drop the other questions I wanted to ask above, atleast for now. A patient Data Scientist with a lot of time on their hands should probably find the cause of the error by doing some of the following:

  • Check the discrepancies across multiple variables
  • Recheck the “Downloadbeskrivelse” to see if the metadata for the data has a description of why there would be a descrepancy

Because if it was an error, this 3.47% difference in Turnover and 0.47% difference in amount of people in the category is something I will let sit.

My main hypothesis which just popped into my head just now is that since the data is missing from the count_persons_range dataset, it has something to do with removing observations from the original dataset that contains too much person-referable [personhenførbart] data.

Now, with that in mind, I, an aspiring Data Scientist with a goal of making a blog post and not taking forever, will just use the data with age intervals, as according to my hypothesis, that is the most “complete” data.

With that dataset, I will provide a brief overview of the drug classes with the highest turnover, by age group.

Making a plot of the turnover by drug class

Now, I have what I need to create the plots of turnover. I choose to make stacked bar charts, because I am used to making them, but there are probably better ways of representing the data.

First, I prepare the data, by filtering out all the drug classes I do not wish to focus on, and making the y-axis more readable by dividing it with 1000, thus making it turnover in 1.000.000 of DKK

# Create vector of drug classes
DCs <- c("A10BJ", "A10BK", "A10BH", "A10BA", "A10BB", "A10BG", "A10BX", "A10A")

# Filter
df_atc_2023_filtered <- df_atc_2023_filtered |>
  filter(atc %in% DCs) |>
  mutate(
    DC = case_when( # New drug class variable
      atc == "A10BJ" ~ "GLP1", # GLP1
      atc == "A10BK" ~ "SGLT2", # SGLT2
      atc == "A10BH" ~ "DPP4", # DPP4
      atc == "A10BA" ~ "Metformin", # MET
      atc == "A10BB" ~ "SU", # SU
      atc == "A10BG" ~ "TZD", # Thiazolidinediones
      atc == "A10BX" ~ "Others", # Others
      atc == "A10A"  ~ "INSULIN",  # INSULIN
      TRUE ~ NA_character_  # Default to NA for all other values
    ),
    turnover1000k = turnover/1000 # 1000k because its already in 1000's.
    )

Now its time to make the plot.

# Plot - stacked bar chart
  # total turnover
p1 <- ggplot(df_atc_2023_filtered, aes(x = agegroup, y = turnover1000k, fill = DC)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Total turnover by drug class",
    caption = "Source: own calculations based on data from medstat.dk via the Danish Health Data Authority"
    )
  # propotional turnover for each group
p2 <- ggplot(df_atc_2023_filtered, aes(x = agegroup, y = turnover1000k, fill = DC)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(
    y = "",
    title = "Proportional turnover by drug class",
    caption = "Source: own calculations based on data from medstat.dk via the Danish Health Data Authority"
    )

p1

ggsave("thumbnail.png", plot = last_plot(), width = 6, height = 4) # p1 saved as thumbnail

p2

Well then.. The turnover for GLP1 in this country is quite impressive, with the 45 to 64 year olds taking by far the biggest piece of the pie. Thats about 6.000.000.000 DKK in turnover for that age group in 2023 alone.

In the next blog post I aim to show how to redo all that I did, as a function.

And hopefully it will also be a bit more organised.